A Brief History of Just-In-Time 

JOHN AYCOCK 

University of Calgary 


Software systems have been using “just-in-time” compilation (JIT) techniques since the 
1960s. Broadly, JIT compilation includes any translation performed dynamically, after a 
program has started execution. We examine the motivation behind JIT compilation and 
constraints imposed on JIT compilation systems, and present a classification scheme for 
such systems. This classification emerges as we survey forty years of JIT work, from 
1960-2000. 
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1. INTRODUCTION 

Those who cannot remember the past are con- 
demned to repeat it. 

George Santayana, 1863-1952 [Bartlett 1992] 

This oft-quoted line is all too applicable 
in computer science. Ideas are generated, 
explored, set aside — only to be reinvented 
years later. Such is the case with what 
is now called “just-in-time” (JIT) or dy- 
namic compilation, which refers to trans- 
lation that occurs after a program begins 
execution. 

Strictly speaking, JIT compilation sys- 
tems (“JIT systems” for short) are com- 
pletely unnecessary. They are only a 
means to improve the time and space ef- 
ficiency of programs. After all, the central 
problem JIT systems address is a solved 
one: translating programming languages 


into a form that is executable on a target 
platform. 

What is translated? The scope and na- 
ture of programming languages that re- 
quire translation into executable form 
covers a wide spectrum. Traditional pro- 
gramming languages like Ada, C, and 
Java are included, as well as little lan- 
guages [Bentley 1988] such as regular 
expressions. 

Traditionally, there are two approaches 
to translation: compilation and interpreta- 
tion. Compilation translates one language 
into another — C to assembly language, for 
example — with the implication that the 
translated form will be more amenable 
to later execution, possibly after further 
compilation stages. Interpretation elimi- 
nates these intermediate steps, perform- 
ing the same analyses as compilation, but 
performing execution immediately. 
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JIT compilation is used to gain the ben- 
efits of both (static) compilation and inter- 
pretation. These benefits will be brought 
out in later sections, so we only summa- 
rize them here: 

— Compiled programs run faster, espe- 
cially if they are compiled into a form 
that is directly executable on the under- 
lying hardware. Static compilation can 
also devote an arbitrary amount of time 
to program analysis and optimization. 
This brings us to the primary constraint 
on JIT systems: speed. A JIT system 
must not cause untoward pauses in nor- 
mal program execution as a result of its 
operation. 

— Interpreted programs are typically 
smaller, if only because the represen- 
tation chosen is at a higher level than 
machine code, and can carry much more 
semantic information implicitly. 

— Interpreted programs tend to be 
more portable. Assuming a machine- 
independent representation, such as 
high-level source code or virtual ma- 
chine code, only the interpreter need be 
supplied to run the program on a differ- 
ent machine. (Of course, the program 
still may be doing nonportable opera- 
tions, but that’s a different matter.) 

— Interpreters have access to run-time 
information, such as input parame- 
ters, control flow, and target machine 
specifics. This information may change 
from run to run or be unobtainable 
prior to run-time. Additionally, gather- 
ing some types of information about a 
program before it runs may involve al- 
gorithms which are undecidable using 
static analysis. 

To narrow our focus somewhat, we 
only examine software-based JIT systems 
that have a nontrivial translation aspect. 
Keppel et al. [1991] eloquently built an ar- 
gument for the more general case of run- 
time code generation, where this latter re- 
striction is removed. 

Note that we use the term execution in 
a broad sense — we call a program repre- 
sentation executable if it can be executed 
by the JIT system in any manner, either 


Aycock 

directly as in machine code, or indirectly 
using an interpreter. 

2. JIT COMPILATION TECHNIQUES 

Work on JIT compilation techniques often 
focuses around implementation of a par- 
ticular programming language. We have 
followed this same division in this sec- 
tion, ordering from earliest to latest where 
possible. 

2.1. Genesis 

Self-modifying code has existed since the 
earliest days of computing, but we exclude 
that from consideration because there is 
typically no compilation or translation as- 
pect involved. 

Instead, we suspect that the earliest 
published work on JIT compilation was 
McCarthy’s [1960] LISP paper. He men- 
tioned compilation of functions into ma- 
chine language, a process fast enough that 
the compiler’s output needn’t be saved. 
This can be seen as an inevitable result of 
having programs and data share the same 
notation [McCarthy 1981]. 

Another early published reference to 
JIT compilation dates back to 1966. The 
University of Michigan Executive System 
for the IBM 7090 explicitly notes that the 
assembler [University of Michigan 1966b, 
p. 1] and loader [University of Michigan 
1966a, p. 6] can be used to translate and 
load during execution. (The manual’s pref- 
ace says that most sections were written 
before August 1965, so this likely dates 
back further.) 

Thompson’s [1968] paper, published in 
Communications of the ACM, is frequently 
cited as “early work” in modern publi- 
cations. He compiled regular expressions 
into IBM 7094 code in an ad hoc fashion, 
code which was then executed to perform 
matching. 

2.2. LC 2 

The Language for Conversational Com- 
puting, or LC 2 , was designed for in- 
teractive programming [Mitchell et al. 
1968]. Although used briefly at Carnegie- 
Mellon University for teaching, LC 2 was 
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Fig. 1. The time-space tradeoff. 


primarily an experimental language 
[Mitchell 2000]. It might otherwise be 
consigned to the dustbin of history, if 
not for the techniques used by Mitchell 
in its implementation [Mitchell 1970], 
techniques that later influenced JIT 
systems for Smalltalk and Self. 

Mitchell observed that compiled code 
can be derived from an interpreter at run- 
time, simply by storing the actions per- 
formed during interpretation. This only 
works for code that has been executed, 
however — he gave the example of an if- 
then-else statement, where only the else- 
part is executed. To handle such cases, 
code is generated for the unexecuted part 
which reinvokes the interpreter should it 
ever be executed (the then-part, in the 
example above). 

2.3. APL 

The seminal work on efficient APL 
implementation is Abrams’ disserta- 
tion [Abrams 1970]. Abrams concocted 
two key APL optimization strategies, 
which he described using the connotative 
terms cLrag-along and beating. Drag-along 
defers expression evaluation as long as 
possible, gathering context information in 
the hopes that a more efficient evaluation 
method might become apparent; this 
might now be called lazy evaluation. 
Beating is the transformation of code to 
reduce the amount of data manipulation 
involved during expression evaluation. 

Drag-along and beating relate to JIT 
compilation because APL is a very dy- 
namic language; types and attributes of 
data objects are not, in general, known 
until run-time. To fully realize these op- 
timizations’ potential, their application 
must be delayed until run-time informa- 
tion is available. 

Abrams’ “APL Machine” employed two 
separate JIT compilers. The first trans- 


lated APL programs into postfix code for 
a D-machine, 1 which maintained a buffer 
of deferred instructions. The D-machine 
acted as an “algebraically simplifying com- 
piler” [Abrams 1970, p. 84] which would 
perform drag-along and beating at run- 
time, invoking an E-machine to execute 
the buffered instructions when necessary. 

Abrams’ work was directed toward 
an architecture for efficient support of 
APL, hardware support for high-level lan- 
guages being a popular pursuit of the time. 
Abrams never built the machine, however; 
an implementation was attempted a few 
years later [Schroeder and Vaughn 1973]. 2 
The techniques were later expanded upon 
by others [Miller 1977], although the ba- 
sic JIT nature never changed, and were 
used for the software implementation of 
Hewlett-Packard’s APL\3000 [Johnston 
1977; van Dyke 1977], 

2.4. Mixed Code, Throw-Away Code, 
and BASIC 

The tradeoff between execution time and 
space often underlies the argument for JIT 
compilation. This tradeoff is summarized 
in Figure 1. The other consideration is 
that most programs spend the majority of 
time executing a minority of code, based on 
data from empirical studies [Knuth 1971]. 
Two ways to reconcile these observations 
have appeared: mixed code and throw- 
away compiling. 

Mixed code refers to the implementa- 
tion of a program as a mixture of native 
code and interpreted code, proposed in- 
dependently by Dakin and Poole [1973] 
and Dawson [1973]. The frequently ex- 
ecuted parts of the program would be 


1 Presumably D stood for Deferral or Drag-Along. 

2 In the end, Litton Industries (Schroeder and 
Vaughn’s employer) never built the machine 
[Mauriello 2000]. 
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in native code, the infrequently executed 
parts interpreted, hopefully yielding a 
smaller memory footprint with little or no 
impact on speed. A fine-grained mixture is 
implied: implementing the program with 
interpreted code and the libraries with na- 
tive code would not constitute mixed code. 

A further twist to the mixed code ap- 
proach involved customizing the inter- 
preter [Pittman 1987]. Instead of mixing 
native code into the program, the na- 
tive code manifests itself as special vir- 
tual machine instructions; the program is 
then compiled entirely into virtual ma- 
chine code. 

The basic idea of mixed code, switch- 
ing between different types of executable 
code, is still applicable to JIT systems, al- 
though few researchers at the time ad- 
vocated generating the machine code at 
run-time. Keeping both a compiler and an 
interpreter in memory at run-time may 
have been considered too costly on the ma- 
chines of the day, negating any program 
size tradeoff. 

The case against mixed code comes from 
software engineering [Brown 1976]. Even 
assuming that the majority of code will be 
shared between the interpreter and com- 
piler, there are still two disparate pieces 
of code (the interpreter proper and the 
compiler’s code generator) which must be 
maintained and exhibit identical behavior. 

(Proponents of partial evaluation, or 
program specialization, will note that this 
is a specious argument in some sense, be- 
cause a compiler can be thought of as a 
specialized interpreter [Jones et al. 1993]. 
However, the use of partial evaluation 
techniques is not currently widespread.) 

This brings us to the second man- 
ner of reconciliation: throw-away compil- 
ing [Brown 1976]. This was presented 
purely as a space optimization: instead 
of static compilation, parts of a program 
could be compiled dynamically on an as- 
needed basis. Upon exhausting memory, 
some or all of the compiled code could be 
thrown away; the code would be regener- 
ated later if necessary. 

BASIC was the testbed for throw- 
away compilation. Brown [1976] essen- 
tially characterized the technique as a 


good way to address the time-space trade- 
off; Hammond [1977] was somewhat more 
adamant, claiming throw-away compila- 
tion to be superior except when memory 
is tight. 

A good discussion of mixed code and 
throw-away compiling may be found 
in Brown [1990]. 

2.5. FORTRAN 

Some of the first work on JIT systems 
where programs automatically optimize 
their “hot spots” at run-time was due to 
Hansen [1974], 3 He addressed three im- 
portant questions: 

(1) What code should be optimized? 
Hansen chose a simple, low-cost 
frequency model, maintaining a 
frequency-of-execution counter for 
each block of code (we use the generic 
term block to describe a unit of 
code; the exact nature of a block is 
immaterial for our purposes). 

(2) When should the code be optimized? 
The frequency counters served a sec- 
ond role: crossing a threshold value 
made the associated block of code a 
candidate for the next “level” of op- 
timization, as described below. “Su- 
pervisor” code was invoked between 
blocks, which would assess the coun- 
ters, perform optimization if necessary, 
and transfer control to the next block 
of code. The latter operation could be a 
direct call, or interpreter invocation — 
mixed code was supported by Hansen’s 
design. 

(3) How should the code be optimized? 
A set of conventional machine- 
independent and machine-dependent 
optimizations were chosen and or- 
dered, so a block might first be opti- 
mized by constant folding, by common 
subexpression elimination the second 


3 Dawson [1973] mentioned a 1967 report by Barbieri 
and Morrissey where a program begins execution in 
interpreted form, and frequently executed parts “can 
be converted to machine code.” However, it is not clear 
if the conversion to machine code occurred at run- 
time. Unfortunately, we have not been able to obtain 
the cited work as of this writing. 
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time optimization occurs, by code 
motion the third time, and so on. 
Hansen [1974] observed that this 
scheme limits the amount of time 
taken at any given optimization point 
(especially important if the frequency 
model proves to be incorrect), as 
well as allowing optimizations to be 
incrementally added to the compiler. 

Programs using the resulting Adap- 
tive FORTRAN system reportedly were 
not always faster than their statically 
compiled-and-optimized counterparts, but 
performed better overall. 

Returning again to mixed code, Ng and 
Cantoni [1976] implemented a variant of 
FORTRAN using this technique. Their 
system could compile functions at run- 
time into “pseudo-instructions,” probably 
a tokenized form of the source code rather 
than a lower-level virtual machine code. 
The pseudo-instructions would then be 
interpreted. They claimed that run-time 
compilation was useful for some applica- 
tions and avoided a slow compile-link pro- 
cess. They did not produce mixed code 
at run-time; their use of the term re- 
ferred to the ability to have statically 
compiled FORTRAN programs call their 
pseudo-instruction interpreter automati- 
cally when needed via linker trickery. 

2.6. Smalltalk 

Smalltalk source code is compiled into vir- 
tual machine code when new methods are 
added to a class [Goldberg and Robson 
1985] . The performance of naive Smalltalk 
implementations left something to be de- 
sired, however. 

Rather than attack the performance 
problem with hardware, Deutsch and 
Schiffman [1984] made key optimizations 
in software. The observation behind this 
was that they could pick the most efficient 
representation for information, so long as 
conversion between representations hap- 
pened automatically and transparently to 
the user. 

JIT conversion of virtual machine code 
to native code was one of the optimiza- 
tion techniques they used, a process they 


likened to macro-expansion. Procedures 
were compiled to native code lazily, when 
execution entered the procedure; the na- 
tive code was cached for later use. Their 
system was linked to memory manage- 
ment in that native code would never be 
paged out, just thrown away and regener- 
ated later if necessary. 

In turn, Deutsch and Schiffman [1984] 
credited the dynamic translation idea to 
Rau [1978]. Rau was concerned with “uni- 
versal host machines” which would ex- 
ecute a variety of high-level languages 
well (compared to, say, a specialized APL 
machine). He proposed dynamic trans- 
lation to microcode at the granularity 
of single virtual machine instructions. 
A hardware cache, the dynamic transla- 
tion buffer, would store completed transla- 
tions; a cache miss would signify a missing 
translation, and fault to a dynamic trans- 
lation routine. 

2.7. Self 

The Self programming language [Ungar 
and Smith 1987; Smith and Ungar 1995], 
in contrast to many of the other lan- 
guages mentioned in this section, is pri- 
marily a research vehicle. Self is in many 
ways influenced by Smalltalk, in that 
both are pure object-oriented languages — 
everything is an object. But Self eschews 
classes in favor of prototypes, and oth- 
erwise attempts to unify a number of 
concepts. Every action is dynamic and 
changeable, and even basic operations, 
like local variable access, require invoca- 
tion of a method. To further complicate 
matters, Self is a dynamically-typed lan- 
guage, meaning that the types of identi- 
fiers are not known until run-time. 

Self’s unusual design makes efficient 
implementation difficult. This resulted in 
the development of the most aggressive, 
ambitious JIT compilation and optimiza- 
tion up to that time. The Self group 
noted three distinct generations of com- 
piler [Holzle 1994], an organization we fol- 
low below; in all cases, the compiler was 
invoked dynamically upon a method’s in- 
vocation, as in Deutsch and Schiffman’s 
[1984] Smalltalk system. 
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2.7.1. First Generation. Almost all the op- 
timization techniques employed by Self 
compilers dealt with type information, and 
transforming a program in such a way 
that some certainty could be had about the 
types of identifiers. Only a few techniques 
had a direct relationship with JIT compi- 
lation, however. 

Chief among these, in the first- 
generation Self compiler, was customiza- 
tion [Chambers et al. 1989; Chambers and 
Ungar 1989; Chambers 1992]. Instead 
of dynamically compiling a method into 
native code that would work for any 
invocation of the method, the compiler 
produced a version of the method that 
was customized to that particular con- 
text. Much more type information was 
available to the JIT compiler compared 
to static compilation, and by exploiting 
this fact the resulting code was much 
more efficient. While method calls from 
similar contexts could share customized 
code, “overcustomization” could still 
consume a lot of memory at run-time; 
ways to combat this problem were later 
studied [Dieckmann and Holzle 1997]. 

2.7.2. Second Generation. The second- 
generation Self compiler extended one 
of the program transformation tech- 
niques used by its predecessor, and 
computed much better type information 
for loops [Chambers and Ungar 1990; 
Chambers 1992]. 

This Self compiler’s output was indeed 
faster than that of the first generation, 
but it came at a price. The compiler ran 
15 to 35 times more slowly on bench- 
marks [Chambers and Ungar 1990, 1991], 
to the point where many users refused to 
use the new compiler [Holzle 1994]! 

Modifications were made to the respon- 
sible algorithms to speed up compila- 
tion [Chambers and Ungar 1991]. One 
such modification was called deferred com- 
pilation of uncommon cases. 4 The compiler 


4 In Chambers’ thesis, this is referred to as “lazy 

compilation of uncommon branches,” an idea he 
attributes to a suggestion by John Maloney in 
1989 [Chambers 1992, p. 123]. However, this is the 
same technique used in Mitchell [1970], albeit for 
different reasons. 


is informed that certain events, such as 
arithmetic overflow, are unlikely to occur. 
That being the case, no code is generated 
for these uncommon cases; a stub is left 
in the code instead, which will invoke the 
compiler again if necessary. The practi- 
cal result of this is that the code for un- 
common cases need not be analyzed upon 
initial compilation, saving a substantial 
amount of time. 5 

Ungar et al. [1992] gave a good presen- 
tation of optimization techniques used in 
Self and the resulting performance in the 
first- and second-generation compilers. 

2.7.3. Third Generation. The third- 
generation Self compiler attacked the 
issue of slow compilation at a much more 
fundamental level. The Self compiler 
was part of an interactive, graphical 
programming environment; executing the 
compiler on-the-fly resulted in a notice- 
able pause in execution. Holzle argued 
that measuring pauses in execution for 
JIT compilation by timing the amount 
of time the compiler took to run was 
deceptive, and not representative of the 
user’s experience [Holzle 1994; Holzle 
and Ungar 1994b]. Two invocations of the 
compiler could be separated by a brief 
spurt of program execution, but would 
be perceived as one long pause by the 
user. Holzle compensated by considering 
temporally related groups of pauses, or 
“pause clusters,” rather than individual 
compilation pauses. 

As for the compiler itself, compi- 
lation time was reduced — or at least 
spread out — by using adaptive optimiza- 
tion, similar to Hansen’s [1974] FOR- 
TRAN work. Initial method compilation 
was performed by a fast, nonoptimizing 
compiler; frequency-of-invocation coun- 
ters were kept for each method to de- 
termine when recompilation should oc- 
cur [Holzle 1994; Holzle and Ungar 1994a, 
1994b]. Holzle makes an interesting com- 
ment on this mechanism: 

... in the course of our experiments we discov- 
ered that the trigger mechanism (“when”) is 


5 This technique can be applied to dynamic compila- 
tion of exception handling code [Lee et al. 2000] . 
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much less important for good recompilation 
results than the selection mechanism (“what”). 
[Holzle 1994, p. 38] 6 

This may come from the slightly coun- 
terintuitive notion that the best candi- 
date for recompilation is not necessarily 
the method whose counter triggered the 
recompilation. Object-oriented program- 
ming style tends to encourage short meth- 
ods; a better choice may be to (re)optimize 
the method’s caller and incorporate the 
frequently invoked method inline [Holzle 
and Ungar 1994b]. 

Adaptive optimization adds the compli- 
cation that a modified method may already 
be executing, and have information (such 
as an activation record on the stack) that 
depends on the previous version of the 
modified method [Holzle 1994]; this must 
be taken into consideration. 7 

The Self compiler’s JIT optimization 
was assisted by the introduction of “type 
feedback” [Holzle 1994; Holzle and Ungar 
1994a] . As a program executed, type infor- 
mation was gathered by the run-time sys- 
tem, a straightforward process. This type 
information would then be available if and 
when recompilation occurred, permitting 
more aggressive optimization. Informa- 
tion gleaned using type feedback was later 
shown to be comparable with, and perhaps 
complementary to, information from static 
type inference [Agesen and Holzle 1995; 
Agesen 1996], 


2.8. Slim Binaries and Oberon 

One problem with software distribution 
and maintenance is the heterogeneous 
computing environment in which soft- 
ware runs: different computer architec- 
tures require different binary executables. 
Even within a single line of backward- 
compatible processors, many variations in 
capability can exist; a program statically 


6 The same comment, with slightly different wording, 
also appears in Holzle and Ungar [1994a, p. 328]. 

7 Hansen’s work in 1974 could ignore this possibility; 

the FORTRAN of the time did not allow recursion, 
and so activation records and a stack were unneces- 
sary [Sebesta 1999], 


compiled for the least-common denomina- 
tor of processor may not take full advan- 
tage of the processor on which it eventu- 
ally executes. 

In his doctoral work, Franz ad- 
dressed these problems using “slim 
binaries” [Franz 1994; Franz and Kistler 
1997]. A slim binary contains a high-level, 
machine-independent representation 8 
of a program module. When a module 
is loaded, executable code is generated 
for it on-the-fly, which can presumably 
tailor itself to the run-time environment. 
Franz, and later Kistler, claimed that 
generating code for an entire module at 
once was often superior to the method- 
at-a-time strategy used by Smalltalk 
and Self, in terms of the resulting code 
performance [Franz 1994; Kistler 1999]. 

Fast code generation was critical to the 
slim binary approach. Data structures 
were delicately arranged to facilitate this; 
generated code that could be reused was 
noted and copied if needed later, rather 
than being regenerated [Franz 1994], 

Franz implemented slim binaries for 
the Oberon system, which allows dynamic 
loading of modules [Wirth and Gutknecht 
1989]. Loading and generating code for a 
slim binary was not faster than loading a 
traditional binary [Franz 1994; Franz and 
Kistler 1997], but Franz argued that this 
would eventually be the case as the speed 
discrepancy between processors and in- 
put/output (I/O) devices increased [Franz 
1994], 

Using slim binaries as a starting point, 
Kistler’s [1999] work investigated “contin- 
uous” run-time optimization, where parts 
of an executing program can be optimized 
ad infinitum. He contrasted this to the 
adaptive optimization used in Self, where 
optimization of methods would eventually 
cease. 

Of course, reoptimization is only useful 
if a new, better, solution can be obtained; 
this implies that continuous optimization 
is best suited to optimizations whose in- 
put varies over time with the program’s 


8 This representation is an abstract syntax tree, to 
be precise. 
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execution. 9 Accordingly, Kistler looked at 
cache optimizations — rearranging fields 
in a structure dynamically to optimize 
a program’s data-access patterns [Kistler 
1999; Kistler and Franz 1999] — and a dy- 
namic version of trace scheduling, which 
optimizes based on information about 
a program’s control flow during execu- 
tion [Kistler 1999]. 

The continuous optimizer itself executes 
in the background, as a separate low- 
priority thread which executes only dur- 
ing a program’s idle time [Kistler 1997, 
1999]. Kistler used a more sophisticated 
metric than straightforward counters to 
determine when to optimize, and observed 
that deciding what to optimize is highly 
optimization-specific [Kistler 1999]. 

An idea similar to continuous optimiza- 
tion has been implemented for Scheme. 
Burger [1997] dynamically reordered code 
blocks using profile information, to im- 
prove code locality and hardware branch 
prediction. His scheme relied on the (copy- 
ing) garbage collector to locate pointers 
to old versions of a function, and update 
them to point to the newer version. This 
dynamic recompilation process could be 
repeated any number of times [Burger 
1997, page 70]. 

2.9. Templates, ML, and C 

ML and C make strange bedfellows, but 
the same approach has been taken to dy- 
namic compilation in both. This approach 
is called staged compilation, where compi- 
lation of a single program is divided into 
two stages: static and dynamic compila- 
tion. Prior to run-time, a static compiler 
compiles “templates,” essentially building 
blocks which are pieced together at run- 
time by the dynamic compiler, which may 
also place run-time values into holes left in 
the templates. Typically these templates 
are specified by user annotations, al- 
though some work has been done on deriv- 
ing them automatically [Mock et al. 1999] . 


9 Although, making the general case for run-time op- 

timization, he discussed intermodule optimizations 
where this is not the case [Kistler 1997], 


As just described, template-based sys- 
tems arguably do not fit our description of 
JIT compilers, since there would appear to 
be no nontrivial translation aspect. How- 
ever, templates may be encoded in a form 
which requires run-time translation be- 
fore execution, or the dynamic compiler 
may perform run-time optimizations after 
connecting the templates. 

Templates have been applied to (sub- 
sets of) ML [Leone and Lee 1994; Lee 
and Leone 1996; Wickline et al. 1998]. 
They have also been used for run-time spe- 
cialization of C [Consel and Noel 1996; 
Marlet et al. 1999], as well as dynamic 
extensions of C [Auslander et al. 1996; 
Engler et al. 1996; Poletto et al. 1997]. 
One system, Dynamo, 10 proposed to per- 
form staged compilation and dynamic op- 
timization for Scheme and Java, as well as 
for ML [Leone and Dybvig 1997]. 

Templates aside, ML may be dynami- 
cally compiled anyway. In Cardelli’s de- 
scription of his ML compiler, he noted: 

[Compilation] is repeated for every definition or 
expression typed by the user. . . or fetched from 
an external file. Because of the interactive use 
of the compiler, the compilation of small phrases 
must be virtually instantaneous. [Cardelli 1984, 
p. 209] 

2.10. Erlang 

Erlang is a functional language, designed 
for use in large, soft real-time systems 
such as telecommunications equipment 
[Armstrong 1997]. Johansson et al. [2000] 
described the implementation of a JIT 
compiler for Erlang, HiPE, designed to ad- 
dress performance problems. 

As a recently designed system without 
historical baggage, HiPE stands out in 
that the user must explicitly invoke the 
JIT compiler. The rationale for this is that 
it gives the user a fine degree of control 
over the performance/code space tradeoff 
that mixed code offers [Johansson et al. 
2000 ]. 

HiPE exercises considerable care when 
performing “mode-switches” back and 


10 A name collision: Leone and Dybvig’s “Dynamo” is 
different from the “Dynamo” of Bala et al. [1999]. 
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forth between native and interpreted code. 
Mode-switches may be needed at the obvi- 
ous locations — calls and returns — as well 
as for thrown exceptions. Their calls use 
the mode of the caller rather than the 
mode of the called code; this is in con- 
trast to techniques used for mixed code 
in Lisp (Gabriel and Masinter [1985] dis- 
cussed mixed code calls in Lisp and their 
performance implications). 

2.11. Specialization and O’Caml 

O’Caml is another functional language, 
and can be considered a dialect of ML 
[Remy et al. 1999]. The O’Caml inter- 
preter has been the focus of run-time spe- 
cialization work. 

Piumarta and Riccardi [1998] special- 
ized the interpreter’s instructions to the 
program being run, in a limited way. 11 
They first dynamically translated inter- 
preted bytecodes into direct threaded 
code [Bell 1973], then dynamically com- 
bined blocks of instructions together into 
new “macro opcodes,” modifying the code 
to use the new instructions. This reduced 
the overhead of instruction dispatch, and 
yielded opportunities for optimization in 
macro opcodes which would not have been 
possible if the instructions had been sepa- 
rate (although they did not perform such 
optimizations). As presented, their tech- 
nique did not take dynamic execution 
paths into account, and they noted that it 
is best suited to low-level instruction sets, 
where dispatch time is a relatively large 
factor in performance. 

A more general approach to run-time 
specialization was taken by Thibault et al. 
[2000]. They applied their program spe- 
cialize^ Tempo [Consel et al. 1998], to the 
Java virtual machine and the O’Caml in- 
terpreter at run-time. They noted: 

While the speedup obtained by specialization 

is significant, it does not compete with results 

obtained with hand-written off-line or run-time 

compilers. [Thibault et al. 2000, p. 170] 


11 Thibault et al. [2000] provided an alternative view 
on Piumarta and Riccardi’s work with respect to 
specialization. 


But later in the paper they stated that 

. . . program specialization is entering relative 

maturity. [Thibault et al. 2000, p. 175] 

This may be taken to imply that, at least 
for the time being, program specialization 
may not be as fruitful as other approaches 
to dynamic compilation and optimization. 

2.12. Prolog 

Prolog systems dynamically compile, too, 
although the execution model of Pro- 
log necessitates use of specialized tech- 
niques. Van Roy [1994] gave an outstand- 
ing, detailed survey of the area. One of 
SICStus Prolog’s native code compilers, 
which could be invoked and have its out- 
put loaded dynamically, was described in 
Haygood [1994]. 

2.13. Simulation, Binary Translation, 
and Machine Code 

Simulation is the process of running na- 
tive executable machine code for one ar- 
chitecture on another architecture. 12 How 
does this relate to JIT compilation? One 
of the techniques for simulation is bi- 
nary translation; in particular, we focus on 
dynamic binary translation that involves 
translating from one machine code to an- 
other at run-time. Typically, binary trans- 
lators are highly specialized with respect 
to source and target; research on retar- 
getable and “resourceable” binary trans- 
lators is still in its infancy [Ung and 
Cifuentes 2000]. Altman et al. [2000b] 
have a good discussion of the challenges 
involved in binary translation, and Cmelik 
and Keppel [1994] compared pre-1995 
simulation systems in detail. Rather than 
duplicating their work, we will take a 
higher-level view. 

May [1987] proposed that simulators 
could be categorized by their implementa- 
tion technique into three generations. To 


12 We use the term simulate in preference to emulate 
as the latter has the connotation that hardware is 
heavily involved in the process. However, some liter- 
ature uses the words interchangeably. 
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this, we add a fourth generation to char- 
acterize more recent work. 

(1) First-generation simulators were 
interpreters, which would simply 
interpret each source instruction as 
needed. As might be expected, these 
tended to exhibit poor performance 
due to interpretation overhead. 

(2) Second-generation simulators dynam- 
ically translated source instructions 
into target instruction one at a time, 
caching the translations for later use. 

(3) Third-generation simulators impro- 
ved upon the performance of second- 
generation simulators by dynamically 
translating entire blocks of source in- 
structions at a time. This introduces 
new questions as to what should be 
translated. Most such systems trans- 
lated either basic blocks of code or 
extended basic blocks [Cmelik and 
Keppel 1994], reflecting the static 
control flow of the source program. 
Other static translation units are pos- 
sible: one anomalous system, DAISY, 
performed page-at-a-time translations 
from PowerPC to VLIW instructions 
[Ebcioglu and Altman 1996, 1997]. 

(4) What we call fourth-generation 
simulators expand upon the third- 
generation by dynamically translating 
paths, or traces. A path reflects the 
control flow exhibited by the source 
program at run-time, a dynamic in- 
stead of a static unit of translation. 
The most recent work on binary trans- 
lation is concentrated on this type of 
system. 

Fourth-generation simulators are pre- 
dominant in recent literature [Bala et al. 
1999; Chen et al. 2000; Deaver et al. 1999; 
Gschwind et al. 2000; Klaiber 2000; Zheng 
and Thompson 2000]. The structure of 
these is fairly similar: 

(1) Profiled execution. The simulator’s 
effort should be concentrated on “hot” 
areas of code that are frequently exe- 
cuted. For example, initialization code 
that is executed only once should not 
be translated or optimized. To deter- 


mine which execution paths are hot, 
the source program is executed in some 
manner and profile information is 
gathered. Time invested in doing this 
is assumed to be recouped eventually. 

When source and target architec- 
tures are dissimilar, or the source ar- 
chitecture is uncomplicated (such as 
a reduced instruction set computer 
(RISC) processor) then interpretation 
of the source program is typically 
employed to execute the source pro- 
gram [Bala et al. 1999; Gschwind et al. 
2000; Transmeta Corporation 2001; 
Zheng and Thompson 2000]. The al- 
ternative approach, direct execution, is 
best summed up by Rosenblum et al. 
[1995, p. 36]: 

By far the fastest simulator of the CPU, 

MMU, and memory system of an SGI mul- 
tiprocessor is an SGI multiprocessor. 

In other words, when the source and 
target architectures are the same, as 
in the case where the goal is dynamic 
optimization of a source program, the 
source program can be executed di- 
rectly by the central processing unit 
(CPU). The simulator regains control 
periodically as a result of appropri- 
ately modifying the source program 
[Chen et al. 2000] or by less di- 
rect means such as interrupts [Gorton 
2001 ]. 

(2) Hot path detection. In lieu of hard- 
ware support, hot paths may be de- 
tected by keeping counters to record 
frequency of execution [Zheng and 
Thompson 2000], or by watching for 
code that is structurally likely to be 
hot, like the target of a backward 
branch [Bala et al. 1999]. With hard- 
ware support, the program’s program 
counter can be sampled at intervals to 
detect hot spots [Deaver et al. 1999]. 

Some other considerations are that 
paths may be strategically excluded if 
they are too expensive or difficult to 
translate [Zheng and Thompson 2000] , 
and choosing good stopping points for 
paths can be as important as choos- 
ing good starting points in terms 
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of keeping a manageable number of 
traces [Gschwind et al. 2000]. 

(3) Code generation and optimization. 
Once a hot path has been noted, the 
simulator will translate it into code 
for the target architecture, or perhaps 
optimize the code. The correctness of 
the translation is always at issue, and 
some empirical verification techniques 
are discussed in [Zheng and Thompson 
2000 ]. 

(4) “Bail-out” mechanism. In the case of 
dynamic optimization systems (where 
the source and target architectures are 
the same), there is the potential for 
a negative impact on the source pro- 
gram’s performance. A bail-out mech- 
anism [Bala et al. 1999] heuristically 
tries to detect such a problem and re- 
vert back to the source program’s di- 
rect execution; this can be spotted, for 
example, by monitoring the stability of 
the working set of paths. Such a mech- 
anism can also be used to avoid han- 
dling complicated cases. 

Another recurring theme in recent 
binary translation work is the issue of 
hardware support for binary translation, 
especially for translating code for legacy 
architectures into VLIW code. This has 
attracted interest because VLIW archi- 
tectures promise legacy architecture 
implementations which have higher per- 
formance, greater instruction-level paral- 
lelism [Ebcioglu and Altman 1996, 1997], 
higher clock rates [Altman et al. 2000a; 
Gschwind et al. 2000], and lower power 
requirements [Klaiber 2000]. Binary 
translation work in these processors is 
still done by software at run-time, and is 
thus still dynamic binary translation, al- 
though occasionally packaged under more 
fanciful names to enrapture venture capi- 
talists [Geppert and Perry 2000]. The key 
idea in these systems is that, for efficiency, 
the target VLIW should provide a super- 
set of the source architecture [Ebcioglu 
and Altman 1997]; these extra resources, 
unseen by the source program, can be used 
by the binary translator for aggressive 
optimizations or to simulate troublesome 
aspects of the source architecture. 


2.14. Java 

Java is implemented by static compila- 
tion to bytecode instructions for the Java 
virtual machine, or JVM. Early JVMs 
were only interpreters, resulting in less- 
than-stellar performance: 

Interpreting bytecodes is slow. [Cramer et al. 

1997, p. 37] 

Java isn’t just slow, it’s really slow, surprisingly 

slow. [Tyma 1998, p. 41] 

Regardless of how vitriolic the expres- 
sion, the message was that Java programs 
had to run faster, and the primary means 
looked to for accomplishing this was JIT 
compilation of Java bytecodes. Indeed, 
Java brought the term just-in-time into 
common use in computing literature. 13 
Unquestionably, the pressure for fast Java 
implementations spurred a renaissance in 
JIT research; at no other time in history 
has such concentrated time and money 
been invested in it. 

An early view of Java JIT compilation 
was given by Cramer et al. [1997], who 
were engineers at Sun Microsystems, the 
progenitor of Java. They made the ob- 
servation that there is an upper bound 
on the speedup achievable by JIT compi- 
lation, noting that interpretation proper 
only accounted for 68% of execution time 
in a profile they ran. They also advocated 
the direct use of JVM bytecodes, a stack- 
based instruction set, as an intermedi- 
ate representation for JIT compilation and 
optimization. In retrospect, this is a mi- 
nority viewpoint; most later work, includ- 
ing Sun’s own [Sun Microsystems 2001], 
invariably began by converting JVM 
code into a register-based intermediate 
representation. 

The interesting trend in Java JIT 
work [Adl-Tabatabai et al. 1998; Bik et al. 
1999; Burke et al. 1999; Cierniak and 
Li 1997; Ishizaki et al. 1999; Krall and 
Grafl 1997; Krall 1998; Yang et al. 1999] 
is the implicit assumption that mere 


13 Gosling [2001] pointed out that the term just- 
in-time was borrowed from manufacturing terminol- 
ogy, and traced his own use of the term back to about 
1993. 


ACM Computing Surveys, Vol. 35, No. 2, June 2003. 



108 


Aycock 


translation from bytecode to native code is 
not enough: code optimization is necessary 
too. At the same time, this work recognizes 
that traditional optimization techniques 
are expensive, and looks for modifica- 
tions to optimization algorithms that 
strike a balance between speed of algo- 
rithm execution and speed of the resulting 
code. 

There have also been approaches to 
Java JIT compilation besides the usual 
interpret-first-optimize-later. A compile- 
only strategy, with no interpreter whatso- 
ever, was adopted by Burke et al. [1999], 
who also implemented their system in 
Java; improvements to their JIT directly 
benefited their system. Agesen [1997] 
translated JVM bytecodes into Self code, 
to leverage optimizations already exist- 
ing in the Self compiler. Annotations were 
tried by Azevedo et al. [1999] to shift the 
effort of code optimization prior to run- 
time: information needed for efficient JIT 
optimization was precomputed and tagged 
on to bytecode as annotations, which were 
then used by the JIT system to assist its 
work. Finally, Plezbert and Cytron [1997] 
proposed and evaluated the idea of “con- 
tinuous compilation” for Java in which 
an interpreter and compiler would exe- 
cute concurrently, preferably on separate 
processors. 14 

3. CLASSIFICATION OF JIT SYSTEMS 

In the course of surveying JIT work, some 
common attributes emerged. We propose 
that JIT systems can be classified accord- 
ing to three properties: 

(1) Invocation. A JIT compiler is explic- 
itly invoked if the user must take some 
action to cause compilation at run- 
time. An implicitly invoked JIT com- 
piler is transparent to the user. 

(2) Executability . JIT systems typically 
involve two languages: a source lan- 
guage to translate from, and a tar- 
get language to translate to (although 


14 As opposed to the ongoing optimization of Kistler’s 
[2001] “continuous optimization,” only compilation 
occurred concurrently using “continuous compila- 
tion,” and only happened once. 


these languages can be the same, if 
the JIT system is only performing op- 
timization on-the-fly). We call a JIT 
system monoexecutable if it can only 
execute one of these languages, and 
polyexecutable if can execute more 
than one. Polyexecutable JIT systems 
have the luxury of deciding when com- 
piler invocation is warranted, since ei- 
ther program representation can be 
used. 

(3) Concurrency. This property charac- 
terizes how the JIT compiler executes, 
relative to the program itself. If pro- 
gram execution pauses under its own 
volition to permit compilation, it is not 
concurrent; the JIT compiler in this 
case may be invoked via subroutine 
call, message transmission, or transfer 
of control to a coroutine. In contrast, a 
concurrent JIT compiler can operate as 
the program executes concurrently: in 
a separate thread or process, even on a 
different processor. 

JIT systems that function in hard real 
time may constitute a fourth classifying 
property, but there seems to be little re- 
search in the area at present; it is un- 
clear if hard real-time constraints pose 
any unique problems to JIT systems. 

Some trends are apparent. For instance, 
implicitly invoked JIT compilers are defi- 
nitely predominant in recent work. Exe- 
cutability varies from system to system, 
but this is more an issue of design than 
an issue of JIT technology. Work on con- 
current JIT compilers is currently only be- 
ginning, and will likely increase in impor- 
tance as processor technology evolves. 

4. TOOLS FOR JIT COMPILATION 

General, portable tools for JIT compilation 
that help with the dynamic generation of 
binary code did not appear until relatively 
recently. To varying degrees, these toolkits 
address three issues: 

(1) Binary code generation. As argued 
in Ramsey and Fernandez [1995], 
emitting binary code such as machine 
language is a situation rife with oppor- 
tunities for error. There are associated 
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Table 1 . Comparison of JIT Toolkits 


Source 

Binary code 
generation 

Cache 

coherence 

Execution 

Abstract 

interface 

Input 

Engler [1996] 

• 

• 

• 

• 

ad hoc 

Engler and Proebsting [1994] 

• 

• 

• 

• 

tree 

Fraser and Proebsting [1999] 

• 

• 

• 

• 

postfix 

Keppel [1991] 

Ramsey and Fernandez [1995] 

• 

• 

• 

• 

n/a 

ad hoc 


Note : n/a = not applicable. 


bookkeeping tasks too: information 
may not yet be available upon initial 
code generation, like the location of for- 
ward branch targets. Once discovered, 
the information must be backpatched 
into the appropriate locations. 

(2) Cache coherence. CPU speed ad- 
vances have far outstripped mem- 
ory speed advances in recent years 
[Hennessy and Patterson 1996]. To 
compensate, modern CPUs incorpo- 
rate a small, fast cache memory, the 
contents of which may get temporar- 
ily out of sync with main memory. 
When dynamically generating code, 
care must be taken to ensure that the 
cache contents reflect code written to 
main memory before execution is at- 
tempted. The situation is even more 
complicated when several CPUs share 
a single memory. Keppel [1991] gave a 
detailed discussion. 

(3) Execution. The hardware or operat- 
ing system may impose restrictions 
which limit where executable code 
may reside. For example, memory ear- 
marked for data may not allow ex- 
ecution (i.e., instruction fetches) by 
default, meaning that code could be 
generated into the data memory, but 
not executed without platform-specific 
wrangling. Again, refer to Keppel 
[1991], 

Only the first issue is relevant for JIT 
compilation to interpreted virtual ma- 
chine code — interpreters don’t directly ex- 
ecute the code they interpret — but there is 
no reason why JIT compilation tools can- 
not be useful for generation of nonnative 
code as well. 


Table I gives a comparison of the 
toolkits. In addition to indicating how 
well the toolkits support the three areas 
above, we have added two extra cate- 
gories. First, an abstract interface is one 
that is architecture-independent. Use of 
a toolkit’s abstract interface implies that 
very little, if any, of the user’s code 
needs modification in order to use a 
new platform. The drawbacks are that 
architecture-dependent operations like 
register allocation may be difficult, and 
the mapping from abstract to actual ma- 
chine may be suboptimal, such as a map- 
ping from RISC abstraction to complex in- 
struction set computer (CISC) machinery. 

Second, input refers to the structure, if 
any, of the input expected by the toolkit. 
With respect to JIT compilation, more 
complicated input structures take more 
time and space for the user to produce and 
the toolkit to consume [Engler 1996]. 

Using a tool may solve some prob- 
lems but introduce others. Tools for bi- 
nary code generation help avoid many 
errors compared to manually emitting bi- 
nary code. These tools, however, require 
detailed knowledge of binary instruction 
formats whose specification may itself be 
prone to error. Engler and Hsieh [2000] 
presented a “metatool” that can automat- 
ically derive these instruction encodings 
by repeatedly querying the existing sys- 
tem assembler with varying inputs. 

5. CONCLUSION 

Dynamic, or just-in-time, compilation is 
an old implementation technique with 
a fragmented history. By collecting this 
historical information together, we hope to 
shorten the voyage of rediscovery. 
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